Group Members:
Aakash Shetty
Amit Phadke
Pratik Patil
Saket Tulsan
Vatsal Gandhi
Car accidents happen in New york City at an alarming rate. Thousands of people suffer personal injuries everyday. According to the New York City Police Department, there were 228,047 car accidents in 2018 citywide. That breaks down to:
19,000 Car Accidents per month 4,750 Car Accidents per week 678 Car Accidents per day 28 Car Accidents per hour 1 Car Accident every 2 minute
The question that arise from this are: What location of new york city has most number of accidents occured? What were the causes? Did most of the accidents occur on rush hour or in midnight hours?
In the analysis given below we try to analyze the different factors that affect these accidents.
Below you will find the implementation of a few processes we have done for analysis. You can jump to the sections:
1. Data Cleaning
2. Exploratory Data Cleaning
3. Statistics and Machine Learning
First we will import the library such as numpy, scipy and matplotlib to manipulate, analyze and visualize our data. The second task for setting up our data set is by importing our dataset from a csv to our notebook. Here the csv file is converted into a set of data frames
#importing libraries
import os
import numpy as np #importing numpy array as np
import pandas as pd #importing pandas library as pd
import scipy as sc #importing scipy as sc
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt #importing matplotlib as plt
import plotly.graph_objs as go
import seaborn as sns
import statsmodels.api as sm
sns.set(style='ticks', context='talk')
pd.options.display.max_rows = 20
pd.options.display.max_columns=55
#Read CSV (comma-separated) is a function which reads data from csv file returns list of DataFrames
table = pd.read_csv("nypd-motor-vehicle-collisions.csv")
table.head()
The first step we will do here is cleaning our data. Here we will do operations such as getting our data into a standard format, handling null values, removing unneccesary columns or values etc.
#counting null values
table.isna().sum()
#getting date and time into a standard format of 'yyyy-mm-dd' and 'hh:mm' respectively.
dateparse = lambda x : pd.datetime.strptime(x,'%Y-%m-%dT%H:%M:%S.%f')
table = pd.read_csv("nypd-motor-vehicle-collisions.csv" , parse_dates = ['DATE'] , date_parser = dateparse)
#we will drop street information and location information since we alredy have columns for borough,latitude and longitude
table.drop(table.columns[6:10],axis=1,inplace=True)
#Since the focus of our analysis is not vehicle types we will be ommiting the vehicle records
table.drop(table.columns[20:25],axis=1,inplace=True)
table.drop(table.columns[8:14],axis=1,inplace=True)
table.head()
#Here we will replace the blank spaces in the column heading with and underscore as is the standard format.
cols = table.columns
cols = cols.map(lambda x: x.replace(' ', '_') if isinstance(x, (str)) else x)
table.columns = cols
table.head()
#Blank borough values will be filled with the string as 'Unspecified'
table.BOROUGH=table.BOROUGH.fillna('Unspecified')
#blank zipcodes wil be handled by the value is '0'
table.ZIP_CODE=table.ZIP_CODE.fillna('0')
#if the contributing factors of the vehicles are not known add the the value 'unspecified' on that row
table.CONTRIBUTING_FACTOR_VEHICLE_1=table.CONTRIBUTING_FACTOR_VEHICLE_1.fillna('Unspecified')
table.CONTRIBUTING_FACTOR_VEHICLE_2=table.CONTRIBUTING_FACTOR_VEHICLE_2.fillna('Unspecified')
table.CONTRIBUTING_FACTOR_VEHICLE_3=table.CONTRIBUTING_FACTOR_VEHICLE_3.fillna('Unspecified')
table.CONTRIBUTING_FACTOR_VEHICLE_4=table.CONTRIBUTING_FACTOR_VEHICLE_4.fillna('Unspecified')
table.CONTRIBUTING_FACTOR_VEHICLE_5=table.CONTRIBUTING_FACTOR_VEHICLE_5.fillna('Unspecified')
#table.VEHICLE_TYPE_CODE_1=table.VEHICLE_TYPE_CODE_1.fillna('Unspecified')
table.head()
#In few of the cases we are blank values for latitudes and longitudes. We will add the value '0' to the blank spaces.
table.LATITUDE=table.LATITUDE.fillna('0')
table.LONGITUDE=table.LONGITUDE.fillna('0')
table.head()
#Here we are standardizing the different contributing factors
table['CONTRIBUTING_FACTOR_VEHICLE_1'] = table.CONTRIBUTING_FACTOR_VEHICLE_1.replace(to_replace=["Illnes"],value="Illness")
table['CONTRIBUTING_FACTOR_VEHICLE_1'] = table.CONTRIBUTING_FACTOR_VEHICLE_1.replace(to_replace=["Drugs (illegal)"],value="Drugs (Illegal)")
table['CONTRIBUTING_FACTOR_VEHICLE_1'] = table.CONTRIBUTING_FACTOR_VEHICLE_1.replace(to_replace=["Texting","Cell Phone (hands-free)","Cell Phone (hand-held)","Cell Phone (hand-Held)","Listening/Using Headphones"],value="Cell Phone")
table['CONTRIBUTING_FACTOR_VEHICLE_1'] = table.CONTRIBUTING_FACTOR_VEHICLE_1.replace(to_replace=["Reaction to Uninvolved Vehicle","Reaction to Other Uninvolved Vehicle"],value="Reaction to Uninvolved Vehicle")
table['CONTRIBUTING_FACTOR_VEHICLE_1'] = table.CONTRIBUTING_FACTOR_VEHICLE_1.replace(to_replace=["Using On Board Navigation Device","Other Electronic Device"],value="Devices")
table['CONTRIBUTING_FACTOR_VEHICLE_1'] = table.CONTRIBUTING_FACTOR_VEHICLE_1.replace(to_replace=["Shoulders Defective/Improper"],value="Physical Disability")
#Here we are standardizing the different contributing factors
table['CONTRIBUTING_FACTOR_VEHICLE_2'] = table.CONTRIBUTING_FACTOR_VEHICLE_2.replace(to_replace=["Illnes"],value="Illness")
table['CONTRIBUTING_FACTOR_VEHICLE_2'] = table.CONTRIBUTING_FACTOR_VEHICLE_2.replace(to_replace=["Drugs (illegal)"],value="Drugs (Illegal)")
table['CONTRIBUTING_FACTOR_VEHICLE_2'] = table.CONTRIBUTING_FACTOR_VEHICLE_2.replace(to_replace=["Texting","Cell Phone (hands-free)","Cell Phone (hand-held)","Cell Phone (hand-Held)","Listening/Using Headphones"],value="Cell Phone")
table['CONTRIBUTING_FACTOR_VEHICLE_2'] = table.CONTRIBUTING_FACTOR_VEHICLE_2.replace(to_replace=["Using On Board Navigation Device","Other Electronic Device"],value="Devices")
table['CONTRIBUTING_FACTOR_VEHICLE_2'] = table.CONTRIBUTING_FACTOR_VEHICLE_2.replace(to_replace=["Shoulders Defective/Improper"],value="Physical Disability")
table['CONTRIBUTING_FACTOR_VEHICLE_2'] = table.CONTRIBUTING_FACTOR_VEHICLE_2.replace(to_replace=["Reaction to Uninvolved Vehicle","Reaction to Other Uninvolved Vehicle","Other Vehicular"],value="Reaction to Uninvolved Vehicle")
#Here we are standardizing the different contributing factors
table['CONTRIBUTING_FACTOR_VEHICLE_3'] = table.CONTRIBUTING_FACTOR_VEHICLE_3.replace(to_replace=["Illnes"],value="Illness")
table['CONTRIBUTING_FACTOR_VEHICLE_3'] = table.CONTRIBUTING_FACTOR_VEHICLE_3.replace(to_replace=["Drugs (illegal)"],value="Drugs (Illegal)")
table['CONTRIBUTING_FACTOR_VEHICLE_3'] = table.CONTRIBUTING_FACTOR_VEHICLE_3.replace(to_replace=["Texting","Cell Phone (hands-free)","Cell Phone (hand-held)","Cell Phone (hand-Held)","Listening/Using Headphones"],value="Cell Phone")
table['CONTRIBUTING_FACTOR_VEHICLE_3'] = table.CONTRIBUTING_FACTOR_VEHICLE_3.replace(to_replace=["Using On Board Navigation Device","Other Electronic Device"],value="Devices")
table['CONTRIBUTING_FACTOR_VEHICLE_3'] = table.CONTRIBUTING_FACTOR_VEHICLE_3.replace(to_replace=["Shoulders Defective/Improper"],value="Physical Disability")
table['CONTRIBUTING_FACTOR_VEHICLE_3'] = table.CONTRIBUTING_FACTOR_VEHICLE_3.replace(to_replace=["Reaction to Uninvolved Vehicle","Reaction to Other Uninvolved Vehicle","Other Vehicular"],value="Reaction to Uninvolved Vehicle")
#Here we are standardizing the different contributing factors
table['CONTRIBUTING_FACTOR_VEHICLE_4'] = table.CONTRIBUTING_FACTOR_VEHICLE_4.replace(to_replace=["Illnes"],value="Illness")
table['CONTRIBUTING_FACTOR_VEHICLE_4'] = table.CONTRIBUTING_FACTOR_VEHICLE_4.replace(to_replace=["Drugs (illegal)"],value="Drugs (Illegal)")
table.head()
table['Hour'] = pd.to_datetime(table['TIME']).dt.hour
table.head()
Exporatory Data analysis or EDA is an approach to analyzing your dataset to summarize their characteristics often with visual methods. For the above given dataset we have explored the attributes using appropriate graphical model. This will help us to understand the nature of our data, its behavoir and so on. In the below sections we will analyze our data that with try to answers quesion like why, where and when do these collisions occurs and how many people are affected.
We have divided our EDA in three different parts
Analysis of Collisions based on Contributing Factor
Analysis of collisions based on boroughs
Analysis of collsions based on different Time periods
Another important part of the analysis is exploring why these collision occur. The columns of contrbuting factors give the wide variety of reasons that result into the unfortunate events. In our analysis there was a wide spectrum of causal factors. Some of them being more frequent. Below graph represents the top 10 factors that cause collision in New York City.
Contributing_factor1= table['CONTRIBUTING_FACTOR_VEHICLE_1'].value_counts()
Contributing_factor2= table['CONTRIBUTING_FACTOR_VEHICLE_2'].value_counts()
Contributing_factor3= table['CONTRIBUTING_FACTOR_VEHICLE_3'].value_counts()
Contributing_factor4= table['CONTRIBUTING_FACTOR_VEHICLE_4'].value_counts()
Contributing_factor5= table['CONTRIBUTING_FACTOR_VEHICLE_5'].value_counts()
Contributing_factor = Contributing_factor1+Contributing_factor2+Contributing_factor3+Contributing_factor4+Contributing_factor5
Contributing_factor = Contributing_factor.sort_values(ascending=False).dropna()
Contributing_factor=Contributing_factor.drop(['Unspecified'])
congif = Contributing_factor.head(10)
#Plotting a Bar diagram to explore how causal factors are represented in the data set
congif.plot(kind='barh',title='Top 10 Contributing factors to Motor Vehicle Collisions', figsize=(10,10)).invert_yaxis()
plt.axhline(len(Contributing_factor)-18.5, color='#CC0000')
From the above graph we can see that maximum amount of collision occur due to driver negligence, distraction or inattentions.
New York City is divided into 5 boroughs. We wanted to see how many collision occur in every buroughs of NYC and how many people are either injured or killed in these collisions
table_borough = table.groupby(table.BOROUGH).sum()[['NUMBER_OF_PERSONS_INJURED','NUMBER_OF_PERSONS_KILLED']].join(table.groupby(table.BOROUGH).count()['COLLISION_ID'])
table_borough=table_borough.drop(['Unspecified'])
table_borough
#BOROUGH wise classification of number of collisions
table2=table['BOROUGH'].value_counts()
table2=table2.drop(['Unspecified'])
table2.plot.barh(title='Borough wise number of collisions reported', color="red").invert_yaxis()
#table2.plot.barh(title='Borough wise number of collisions reported', color="red")
plt.xlabel('No. of reported collisions', fontsize=18)
plt.ylabel('Boroughs', fontsize=18)
table_borough['NUMBER_OF_PERSONS_INJURED'].plot(kind='bar', title='Borough wise classification', fontsize=15)
plt.legend(loc="upper right", prop={'size': 10})
plt.ylabel('Count of people injured', fontsize=18)
table_borough['NUMBER_OF_PERSONS_KILLED'].plot(kind='bar',
title='Borough wise classification', fontsize=15)
plt.legend(loc="upper right", prop={'size': 10})
plt.ylabel('Count of people killed', fontsize=18)
Here, we can see that Brooklyn has the most number of people getting killed and injured while Staten Island has the least. Also The highest number of Reported Collisions occur in brooklyn followed by Queens and Manahattan with Staten Island with least number of reported collisons.
In the section below we will explore our data based on day, month, time etc. We will also how these changes in time periods affect our collision rate and the no. of persons killed or injured.
First we will analyze the number of people killed or injured based on all the months in a year.
table['YEAR_DATE'] = pd.to_datetime(table['DATE'])
table['YEAR']= (table['YEAR_DATE']).dt.year
#code for creating a table that shows number of people killed each month.
table['Months'] = pd.to_datetime(table['DATE'])
table_month=table.groupby(table.Months.dt.month).sum()[['NUMBER_OF_PERSONS_INJURED','NUMBER_OF_PERSONS_KILLED']].join(table.groupby(table.Months.dt.month).count()['COLLISION_ID'])
table_month=table_month.reset_index()
table_month=table_month.rename({0: 'Jan', 1: 'Feb',2: 'Mar', 3: 'Apr',4: 'May', 5: 'Jun',6: 'Jul', 7: 'Aug',8: 'Sep', 9: 'Oct',10: 'Nov', 11: 'Dec'})
table_month=table_month.drop(['Months'],axis=1)
table_month
#plotting the table
table_month['COLLISION_ID'].plot.bar(title='Month wise classification', fontsize = 15)
plt.legend(loc="best", prop={'size': 10})
plt.ylabel('No. of Collisions', fontsize=18)
plt.xlabel('Month', fontsize=18)
#plotting the table
table_month['NUMBER_OF_PERSONS_INJURED'].plot.bar(title='Month wise classification', fontsize = 15)
plt.legend(loc="best", prop={'size': 10})
plt.ylabel('No. of people injured', fontsize=18)
plt.xlabel('Month', fontsize=18)
table_month['NUMBER_OF_PERSONS_KILLED'].plot.bar(title='Month wise classification',color="green")
plt.legend(loc="best", prop={'size': 10})
plt.ylabel('No. of people killed', fontsize=18)
plt.xlabel('Month', fontsize=18)
From the above charts we can see that the month of August is the most dangerous given its collision, injury and death rates.
After a per month analysis a good idea would be to analyze different day in a week. This will allow us to understand the nature of collisions on weekends and weekdays
table['WEEKDAY'] = pd.to_datetime(table['DATE'])
table_weekday=table.groupby(table.WEEKDAY.dt.weekday_name).sum()[['NUMBER_OF_PERSONS_INJURED','NUMBER_OF_PERSONS_KILLED']].join(table.groupby(table.WEEKDAY.dt.weekday_name).count()['COLLISION_ID'])
table_weekday
#table_weekday['NUMBER_OF_PERSONS_INJURED'].plot(kind='bar', title='Weekday wise classification', fontsize=15, color = 'green')
table_weekday['COLLISION_ID'].plot(title='Weekday wise classification', fontsize=10, color = 'green')
plt.legend(loc="upper right", prop={'size': 10})
plt.ylabel('No. of Collisions', fontsize=14)
plt.xlabel('Days', fontsize=14)
#table_weekday['NUMBER_OF_PERSONS_INJURED'].plot(kind='bar', title='Weekday wise classification', fontsize=15, color = 'green')
table_weekday['NUMBER_OF_PERSONS_INJURED'].plot(title='Weekday wise classification', fontsize=10, color = 'green')
plt.legend(loc="upper right", prop={'size': 10})
plt.ylabel('No. of people injured', fontsize=14)
plt.xlabel('Days', fontsize=14)
#table_weekday['NUMBER_OF_PERSONS_INJURED'].plot(kind='bar', title='Weekday wise classification', fontsize=15, color = 'green')
table_weekday['NUMBER_OF_PERSONS_KILLED'].plot(title='Weekday wise classification', fontsize=10, color = 'green')
plt.legend(loc="upper right", prop={'size': 10})
plt.ylabel('No. of People Killed', fontsize=14)
plt.xlabel('Days', fontsize=14)
As shown by he above graphs we can see that most accidents happen on Fridays.
Knowing when these collision exactly happen can gives us great insights into accident analysis
table['HOUR'] = pd.to_datetime(table['TIME'])
table_hour=table.groupby(table.HOUR.dt.hour).sum()[['NUMBER_OF_PERSONS_INJURED','NUMBER_OF_PERSONS_KILLED']].join(table.groupby(table.HOUR.dt.hour).count()['COLLISION_ID'])
table_hour
#Hour wise classification of number of collisions
table['TIME'] = pd.to_datetime(table['TIME'])
ax = sns.countplot(x=table.TIME.dt.hour, data=table)
table_hour['NUMBER_OF_PERSONS_INJURED'].plot.bar(title='Hour wise classification of NUMBER_OF_PERSONS_INJURED')
plt.ylabel('No. of people injured', fontsize=18)
table_hour['NUMBER_OF_PERSONS_KILLED'].plot.bar(title='Hour wise classification of NUMBER_OF_PERSONS_KILLED',color="green")
plt.ylabel('No. of people Killed', fontsize=18)
Here we can see that most of the collision, injuries or deaths occur between 16:00 hours to 18:00 hours. Same is the case with morning ours of 8:00 am. This can be explained due peak rush hour trafic during these time. A small irregularity in the data can be scene in the third graph where no of people killed is high at 4 pm.
In all of the above charts we did univariate analysis that is on a single variable. Below we explore our data based on multivariate analysis. Here we will track injuries, deaths and collisions happening every hour and every day
plt.figure(figsize=(18, 8))
table['WEEKDAY'] = pd.to_datetime(table['DATE'])
table['HOUR'] = pd.to_datetime(table['TIME'])
dayofweek = table.groupby([table.WEEKDAY.dt.weekday_name, table.HOUR.dt.hour])['NUMBER_OF_PERSONS_KILLED'].sum().unstack().T
sns.heatmap(dayofweek)
plt.xticks(np.arange(7) + .5, ('SUN', 'MON', 'TUE', 'WED', 'THU', 'FRI', 'SAT'))
#plt.yticks(rotation=0)
plt.ylabel('Hour of Day\n', size=18)
plt.xlabel('\nDay of the Week', size=18)
plt.yticks(rotation=0, size=12)
plt.xticks(rotation=0, size=12)
plt.title("Number of Persons killed Over time\n", size=18, );
plt.figure(figsize=(18, 8))
table['WEEKDAY'] = pd.to_datetime(table['DATE'])
table['HOUR'] = pd.to_datetime(table['TIME'])
dayofweek = table.groupby([table.WEEKDAY.dt.weekday_name, table.HOUR.dt.hour])['NUMBER_OF_PERSONS_INJURED'].sum().unstack().T
sns.heatmap(dayofweek)
plt.xticks(np.arange(7) + .5, ('SUN', 'MON', 'TUE', 'WED', 'THU', 'FRI', 'SAT'))
#plt.yticks(rotation=0)
plt.ylabel('Hour of Day\n', size=18)
plt.xlabel('\nDay of the Week', size=18)
plt.yticks(rotation=0, size=12)
plt.xticks(rotation=0, size=12)
plt.title("Number of Persons Injured Over time\n", size=18, );
In the analysis given below we have explored our most recent data that of 2018.
df_loc = table[['DATE','LATITUDE','LONGITUDE','NUMBER_OF_PERSONS_INJURED','BOROUGH','NUMBER_OF_PERSONS_KILLED']]
currentyear = 2018
df_loc = df_loc[pd.DatetimeIndex(df_loc.DATE).year == currentyear]
df_loc=df_loc[df_loc.BOROUGH != 'Unspecified']
df_loc = df_loc[df_loc.NUMBER_OF_PERSONS_INJURED > 0]
df_loc.head()
#Access token from Plotly
mapbox_access_token = 'pk.eyJ1Ijoia3Jwb3BraW4iLCJhIjoiY2pzcXN1eDBuMGZrNjQ5cnp1bzViZWJidiJ9.ReBalb28P1FCTWhmYBnCtA'
#Prepare data for Plotly
data = [
go.Scattermapbox(
lat=df_loc.LATITUDE,
lon=df_loc.LONGITUDE,
mode='markers',
text=df_loc[['BOROUGH','NUMBER_OF_PERSONS_INJURED']],
marker=dict(
size=7,
color=df_loc.NUMBER_OF_PERSONS_INJURED,
colorscale='RdBu',
reversescale=True,
colorbar=dict(
title='NUMBER_OF_PERSONS_INJURED'
)
),
)
]
#Prepare layout for Plotly
layout = go.Layout(
autosize=True,
hovermode='closest',
title='NYPD Motor Vehicle Collisions in ' + str(currentyear),
mapbox=dict(
accesstoken=mapbox_access_token,
bearing=0,
center=dict(
lat=40.721319,
lon=-73.987130
),
pitch=0,
zoom=11
),
)
from plotly.offline import init_notebook_mode, iplot
#Create map using Plotly
fig = dict(data=data, layout=layout)
iplot(fig, filename='NYPD Collisions')
df_loc1 = df_loc[df_loc.NUMBER_OF_PERSONS_KILLED > 0]
df_loc1.head()
#Access token from Plotly
mapbox_access_token = 'pk.eyJ1Ijoia3Jwb3BraW4iLCJhIjoiY2pzcXN1eDBuMGZrNjQ5cnp1bzViZWJidiJ9.ReBalb28P1FCTWhmYBnCtA'
#Prepare data for Plotly
data = [
go.Scattermapbox(
lat=df_loc1.LATITUDE,
lon=df_loc1.LONGITUDE,
mode='markers',
text=df_loc1[['BOROUGH','NUMBER_OF_PERSONS_KILLED']],
marker=dict(
size=7,
color=df_loc1.NUMBER_OF_PERSONS_KILLED,
colorscale='RdBu',
reversescale=True,
colorbar=dict(
title='NUMBER_OF_PERSONS_KILLED'
)
),
)
]
#Prepare layout for Plotly
layout = go.Layout(
autosize=True,
hovermode='closest',
title='NYPD Motor Vehicle Collisions in ' + str(currentyear),
mapbox=dict(
accesstoken=mapbox_access_token,
bearing=0,
center=dict(
lat=40.721319,
lon=-73.987130
),
pitch=0,
zoom=11
),
)
from plotly.offline import init_notebook_mode, iplot
#Create map using Plotly
fig = dict(data=data, layout=layout)
iplot(fig, filename='NYPD Collisions')
table_reg=table.replace('Unspecified'," ")
table_reg['Months_number'] = table_reg.Months.dt.month
table_reg.head()
# create a fitted model with all three features
#lm = ols(formula=' NUMBER_OF_PERSONS_INJURED ~ BOROUGH + CONTRIBUTING_FACTOR_VEHICLE_1 + Hour ', data=table_reg).fit()
lm = ols(formula=' NUMBER_OF_PERSONS_INJURED ~ Hour + Months_number', data=table_reg).fit()
# print a summary of the fitted model
lm.summary()
Our F static is 219.9 which is significant and also, p values are close to zero. Hence we can say that number of persons killed is related to Months and Hour of the day.
fig, ax = plt.subplots(figsize=(12, 8))
fig = sm.graphics.plot_fit(lm, "Hour", ax=ax)
fig, ax = plt.subplots(figsize=(12, 8))
fig = sm.graphics.plot_fit(lm, "Months_number", ax=ax)
X = np.array([0,2,4,6,8,10,12,14,16,18,20,22])
X
j=table_hour['NUMBER_OF_PERSONS_INJURED'].values
Y=np.array([22020,12319,13920,22495,39349,33845,40408,51471,58053,50719,39248,31100])
plt.grid(True)
plt.scatter(X,Y)
plt.title('')
plt.xlabel("Hours")
plt.ylabel("NUMBER_OF_PEOPLE_INJURED")
plt.show()
#with a polynomial of degree 3 , I now have a curve of best fit.
p3= np.polyfit(X,Y,3)
p1= np.polyfit(X,Y,1)
p2= np.polyfit(X,Y,2)
plt.grid(True)
plt.scatter(X,Y)
plt.xlabel("Hours")
plt.ylabel("NUMBER_OF_PEOPLE_INJURED")
plt.plot(X,np.polyval(p1,X), 'r--', label ='p1')
plt.plot(X,np.polyval(p2,X), 'm:', label='p2')
plt.plot(X,np.polyval(p3,X), 'g--', label = 'p3')
plt.legend(loc="lower right")
plt.show()
np.polyfit(X,Y,3)
#Predict value at X = 14 (2 in the noon).
y_fit_12 = np.polyval(p3, 14)
print ("Y_fit Value at 2 Noon :", y_fit_12)
Our Predicted value at 2 noon is 50844 against the actual value of 51471 which is pretty close
x_fit=np.arange(24)
p3 = np.polyfit(X,Y,3)
y_fit = []
for i in range(len(x_fit)):
y_fit.append(np.polyval(p3, i))
plt.grid(True)
plt.plot(X, Y, label = "Original")
plt.plot(x_fit, y_fit, label = "Fitted")
plt.xlabel("Hours")
plt.ylabel("NUMBER_OF_PEOPLE_INJURED")
plt.legend(loc="lower right")
plt.show()
table_reg['Hour'].fillna(0)
table_reg.dropna(inplace=True)
table_reg = table_reg.loc[~(table_reg.NUMBER_OF_PERSONS_INJURED==0)]
table_reg = table_reg.loc[~(table_reg.Hour==0)]
X=table_reg[['Hour','Months_number']]
Y=table_reg['NUMBER_OF_PERSONS_INJURED']
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn import metrics
model = LinearRegression()
rfe = RFE(model, 5)
rfe.fit(X, Y)
y_pred = rfe.predict(X)
print("Support:",rfe.support_)
print("Ranking",rfe.ranking_)
Based on top contributing factors that lead to collision and time and location, we can curb the same by strict patrolling and driver education.
The dataset can be futher analysed to get more meaningful insights.